Abstract

The purpose of this time series analysis is to apply modern statistical applications to sea surface temperature (SST) data. This includes addressing real-world approaches to incomplete data and model selection. Given the nature of this seasonal, cyclical data, interpolation is deemed the most appropriate technique in regards to the missing data problem. The SST data was cleaned and formatted in a way our team was able to perform exploratory analysis to identify basic seasonal or time trends. The series was further decomposed to expose any underlying trends that may not be obvious upon initial observation. Autocorrelations and partial autocorrelations were generated for this data in order. These steps were needed in order to achieve stationarity and to determine the strength of any correlation. Three models were constructed; a benchmark linear model, a traditional Holt-Winters time series model, and a functional model. The Holt-Winters model fit the series best, with the smallest mean square error and absolute error. The amount of time and effort that went into developing a complex functional model may not have been warranted with producing similar results to the traditional time series method.

Introduction

The purpose of this report is to inform readers on our team’s collaborative efforts in examining sea surface temperatures (SST). The objective of this project was to explore modern methodologies in dealing with missing data as well as forecasting techniques. Overall, we analyzed the performance of three statistical models; a benchmark model, a traditional time series model, and a functional model. In addition, our team will grade and compare each model’s performance based on minimized variance in the residuals it produces. Before these steps can even be taken however, our team worked together to address the initial problem of missing data. Many avenues were explored on how to handle this obstacle. Since we are working with climatological data that produces a seasonal trend, we concluded that interpolation was the appropriate choice. We at first received only a portion of the data to train our models and then the full data set to then test how well our models performed. This report concisely documents our thought process and statistical procedures.

The Data

The data used in this analysis was provided by the National Oceanic and Atmospheric Agency (NOAA) and was believed to be measured within the region of the south Pacific. The data received at first consisted simply of two variables; a date variable (month, year) and a monthly average sea surface temperature given in degrees Celsius. The first record was taken in January 1950 and the data spans to December of 2014. It includes 780 observations. However, 137 of these observations were missing the sea surface temperature for a given time stamp. As we delved into our assignment and our analysis structure matured we later received the complete data set which contained all data points. This new data was applied to our 3 modeling techniques.

Examining Missing Data

After researching different missing data techniques applied in time series and consulting our project advisor, we concluded that seasonal interpolation was the most applicable and appropriate for this data sense given the cyclical and seasonal nature of the series data.

We also took several steps in cleaning, wrangling and preparing the data in order to run any sort of preliminary analysis. First, we created a season variable (winter, spring, summer, fall) to look at any seasonal trends that appeared relevant in the data over time. We also created a decade variable to indicate what decade each monthly average was recorded in to compare any trends in the data across the different decades. We repeated this same procedure for creating a separate month and year variable (this is apart from the month-year variable provided).

Missing Data
Winter Spring Summer Fall
1950 5 7 0 6 18
1960 4 3 8 6 21
1970 3 6 5 7 21
1980 2 2 9 4 17
1990 4 4 10 8 26
2000 4 9 4 5 22
2010 2 4 1 5 12
24 35 37 41 137

Missing Data Interpolations

We had 3 basic approaches for the data interpolation: using the original data, a linear interpolation, and a cubic spline interpolation. The nonparametric approach was the most appropriate; a cubic spline is a spline constructed of piecewise third-order polynomials which pass through a set of \(m\) control points. It should be mentioned that natural cubic spline interpolation does have a tendency of overfitting (Smith, 57-62). This method produced minimal variance an was added as a new numeric variable to the SST data as an interpolation metric. The linear approach did not perform as well, mostly due to the fact that the data violates linearity.

Original Data

The variance in the original data is 1.582. We calculated the mean and standard deviation of the sea surface temperatures of all years per month in order to further examine the data.

Mean SD
Jan 25.45804 1.0178743
Feb 26.25393 0.6038061
Mar 26.97385 0.3676477
Apr 27.38055 0.3222460
May 26.86132 0.6098501
June 26.35696 0.5711306
July 25.61456 0.6493074
Aug 24.83622 0.7230877
Sep 24.69250 0.7643564
Oct 24.80926 0.7890523
Nov 24.78312 1.0351496
Dec 25.05949 1.4309290
Linear Interpolation

The variance after the linear interpolation is slightly lower than the variance of the original data at 1.542. This interpolation keeps the quartiles approximately the same as the original data. While the mean sea surface temperature per month remains approximately the same as the original data, the standard deviation increased for each month.

Mean SD
Jan 25.51151 1.0873482
Feb 26.22781 0.6355249
Mar 26.90738 0.4354900
Apr 27.25372 0.4750281
May 26.86164 0.5446654
June 26.31829 0.5516208
July 25.56521 0.5913527
Aug 24.98372 0.7400923
Sep 24.74205 0.7957403
Oct 24.79395 0.9244672
Nov 24.85823 1.2203654
Dec 25.07110 1.3724462
Cubic Spline Interpolation

The variance after the cubic spline interpolation is larger than the variance of the original data at 1.605. The quartiles are just barely lower than the original data. The standard deviation of sea surface temperatures per month is larger than the original data. The standard deviation is slightly larger than the linear interpolation in the winter months and about the same in the summer months.

Mean SD
Jan 25.49712 1.1043799
Feb 26.23090 0.6346559
Mar 26.96230 0.3974344
Apr 27.31279 0.4197625
May 26.91878 0.5503460
June 26.32420 0.5552533
July 25.54950 0.6129347
Aug 24.91547 0.7182987
Sep 24.70874 0.7933589
Oct 24.77165 0.9502289
Nov 24.82955 1.2543366
Dec 25.05953 1.3796583

Analysis of Missing Data Interpolations

Squared Error

As explained above, the cubic spline interpolation should yield the best results since the data is not linear. Once we received the complete data set, we analyzed the accuracy of our interpolations. The mean squared errors of our linear interpolation and cubic spline interpolation is 0.0254 and 0.0144 respectively. Thus confirming our expectations that the cubic spline interpolation is more accurate.

Preliminary Exploratory Data Analysis

As previously stated, our team wanted to explore seasonal and time trends that may or may not have been present in the data once all wrangling and cleaning was complete.

This first plot breaks up the overall times series across the 7 decades. The most inconsistent decade is 1980, where we see high averages in comparison to the other decades as well as a wider, more variant monthly averages. Likewise the 1990 plot also has notably higher SST, especially when coming the decade that follows. We see much less variation in the later decades of the SST data, which could be due to confounding factors such as climate change, but that is beyond the scope of this report.

These two graphics show the fluctuate of SST over time. The first plot (pictured above) shows the time series of the cubic spline interpolation over time, month 0 being the first data point month (January 1950). This second one (pictured below) is a cluster of lines, each on indicating a different year as a function of time. The 65-year lines are all plotted against each other and an overall sinusoidal pattern is evident.

These boxplots were constructed to again visualize the temperature distributions across the seasons and decades parameters. In the first plot illustrating the decade data, it is interesting that the medians are consistent and constant over the 60 or so year time span, apart from what one might have hypothesized with rising sea temperatures due to global warming. In the second box plot figure, it is easy to see that spring overall has the warmest sea surface temperatures, and given that this data was collected in the southern hemisphere, this is to be expected. Interestingly, there seems to be little difference in the temperature distribution between the summer and winter months; they are practically identical.

Differencing the Data

Time series datasets may contain trends and seasonality, which may need to be removed prior to modeling. Trends can result in a varying mean over time, whereas seasonality can result in a changing variance over time, both which define a time series as being non-stationary. Stationary datasets are those that have a stable mean and variance, and are in turn much easier to model. Differencing is a widely used data transform for making time series data stationary. For example, when modeling, there are assumptions that the summary statistics of observations are consistent. In time series terminology, we refer to this expectation as the time series being stationary. These assumptions can be easily violated in time series by the addition of a trend, seasonality, and other time-dependent structures. The observations in a stationary time series are not dependent on time. Our team checked if our time series is stationary by looking at a line plot of the series over time. Sign of obvious trends, seasonality, or other systematic structures in the series are indicators of a non-stationary series. A more accurate method would be to use a statistical test, such as the Dickey-Fuller test, which we performed on our data using the adf.test() R function. It determines whether a unit root, a feature that can cause issues in statistical inference, is present in a time-series sample. The main idea is the bigger the negative value, the stronger the confirmation of stationarity. In our results, we have a test statistic of -5.128 and a small p-value of 0.01 indicating significance.

One of the time series models we hoped to examine was the Auto-Regressive (AR) Moving Average (MA) model and in particular the ARIMA (p,d,q) model. In order to construct this we needed to determine the parts of the model notion by calculating the autocorrelation and partial autocorrelation. These results are displayed in what are called correlograms, which indicate how the data is related to itself over time based on the number of periods apart, or lags. The autocorrelation function (ACF) displays the correlation between series and lags for the Moving Average (q) of the ARIMA model, and the partial autocorrelation function (PACF) displays the correlation between returns and lags for the auto-regression (p) of the ARIMA model.

In order to carry out these statistical procedure, we decomposed the time series and ran an ACF and PACF on the series’ seasonal components. We decomposed the time series into seasonal, trend and irregular components using moving averages. Using the decompose() function an additive model was calculated using \[Y_t = T_t+S_t+e_t\] The trend component is calculated using a moving average and then removes it from the series. Then the seasonal component is found by averaging the sea surface temperature of each month over all years. Lastly, the error component is calculated by removing the trend and seasonal components from the original time series (RDocumentation).

We also differenced the original time series (with a time lag of 12 for each month) and ran the data through the same functions.

## 
##  Augmented Dickey-Fuller Test
## 
## data:  series
## Dickey-Fuller = -5.1282, Lag order = 9, p-value = 0.01
## alternative hypothesis: stationary

The blue-dotted line determines the strength of correlations. The values that cross the blue-dotted lines determine the notation for the ARIMA model for each dataset.


Modeling Process

A Random Walk, Linear Model

The first model we included in our analysis was a benchmark model to gauge as a criterion before continuing with our analysis. The first one we came up with to look into was a Random Walk (RW) model. A random walk can be expressed by the following:

\[x_t = x_{t-1} + \omega_t\]

The time series can be predicted as a stochastic model with time dependency and this is based entirely on the previous time point \(t-1\). A random walk time series is not stationary (as the AR polynomial root is not greater than 1). Applying first differencing would result in a white noise time series (which would be stationary), and would have minimal autocorrelation.

\[\triangledown x_t = x_t - x_{t-1} = \omega_t\]

Since we are in particular working with cyclical data, hypothetically a seasonal random walk is a plausible model to apply. If the seasonal difference (the season-to-season change) of a time series looks like stationary noise, this suggests that the mean (constant) forecasting model should be applied to the seasonal difference. For monthly data, whose seasonal period is 12, the seasonal difference at period \(t\) is \(X(t)-X(t-12)\). Applying the mean model to this series yields the equation:

\[X(t) - X(t-12) = \alpha\]

This forecasting model will be called the seasonal random walk model, because it assumes that each season’s values form an independent random walk. Thus, the model assumes that January’s temperature for 2020 is a random step away from January’s temperature for 2019, February’s temperature for 2020 is a random step away from February’s temperature for 2019, etc., and the mean temperature of every step is equal to the same constant (denoted here as alpha). The forecast for Jan 2020 ignores all data after Jan 2019; it is based entirely on what happened exactly one year ago.

Because of its universal understanding and less complicated application, the official benchmark model we constructed was a linear regression model. As shown by the plots below, the linear model does not fit the data well. Instead of a seasonal pattern, this forecast is constantly slightly increasing. There is a very wide range of residuals from -3 to 3; however, the plot also shows the homogeneity of error variance.

## integer(0)
## 
## Call:
## lm(formula = SST ~ as.numeric(DATE), data = sst.dat)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.9555 -1.0079  0.0255  0.9472  3.3519 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      17.620332   4.797942   3.672 0.000257 ***
## as.numeric(DATE)  0.004108   0.002420   1.698 0.089986 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.268 on 778 degrees of freedom
## Multiple R-squared:  0.00369,    Adjusted R-squared:  0.00241 
## F-statistic: 2.882 on 1 and 778 DF,  p-value: 0.08999

Linear Model Analysis

The mean squared error of the linear model forecast is 2.065. The plot below shows the linear model forecast (blue line) and the actual sea surface temperature values as points. Because it does not seem to fit the data well, our team continued on to the traditional time series model.

The Holt-Winters Model

The second type of modeling we explored was a traditional time series model. For this, we decided the Holt-Winters Forecast was most applicable to our data. Holt (1957) and Winters (1960) extended Holt’s method to capture seasonality. The Holt-Winters seasonal method comprises the forecast equation and three smoothing equations — one for the level \(l_t\), trend \(b_t\), a seasonal component \(s_t\), and corresponding smoothing parameters \(\alpha\), \(\beta\), and \(\gamma\) and \(m\) is used to denote seasonality. There are two variations to this model, the additive and multiplicative methods. The additive method is favored when the seasonal variations are roughly constant through the time series. The multiplicative method is favored when the seasonal variations are changing proportional to the level of the series. For the purposes of this analysis we used both modeling methods, but forecasted the time series using the additive method. While there did not appear to be a significant difference between the two fits, the additive was deemed more appropriate since we are working with seasonal data which typically has little variation. The sum of squared error for the additive model is 99.797 compared to the multiplicative model which is 102.742.

Holt-Winters Analysis

Using the additive model we forecasted the sea surface temperature for the next 5 years. The plots below show the forecasted temperatures. The second plot compares the Holt-Winters forecast (blue) to the actual sea surface temperature (red). The mean squared error of the forecast is 1.066. Prediction intervals for the Holt-Winters model are calculated without any underlying assumptions about the true model. The k-step-ahead error, \(e_t(k)\), is normally distributed, so a \(1OO*(1- \alpha )\) % prediction interval, where \(\alpha\) is the smoothing parameter, for \(X_{t+k}\) is given by: \[\hat{X_t}(k) \pm Z_{\alpha /2} \sqrt{Var(e_t(k))}\] where \(Z_{\alpha /2}\) is the one-tailed value of the standard normal distribution. For a more in depth analysis visit “Prediction Intervals for the Holt-Winters Forecasting Procedure” by Chatfield and Yar.

Functional Model

The last model we explored was a Functional Model. For the functional model, we considered the data as representing a time series of curves, particularly, sinusoidal curves, as expected from seasonal, cyclical fluctuations. We can use this information to specify that for each year 1950-2020, there is a functional specification for the annual cycle. In particular the function we found to best fit the data is as follows (where \(t\) denotes the temperature, \(y\) is the year and \(m\) indicates month)

\[t_{ij} = y_i + sin(2\pi m_{ij})/(12 - 0.6) + cos(y_i * 10)\]

The plots above aim to show the generic functional pattern of the data. We can see a sinusoidal function each year with the first plot, showing how the sea surface temperature changes monthly with a color change over years. The second and third plot aim to show the yearly pattern changes. The relationship in this case is much less obvious, but we can see generally the time between peaks and troughs to fit a sin relationship to the yearly component.

## 
## Call:
## lm(formula = SST ~ YEAR + sin(2 * pi * MONTH/12 - 0.6) + cos(YEAR * 
##     10), data = sst.dat)
## 
## Coefficients:
##                  (Intercept)                          YEAR  
##                    16.567381                      0.004639  
## sin(2 * pi * MONTH/12 - 0.6)                cos(YEAR * 10)  
##                     1.266221                     -0.134710

Linear Holt-Winters Functional
Mean Squared Error 2.049878 1.056047 1.313298
Absolute Error 74.877035 50.063925 51.177666

Discussion and Conclusion

The overall goal of this project was to explore modern methodologies in dealing with missing data as well as forecasting techniques. This included imputing missing data and interpolation. Overall, we analyzed the performance of three statistical models; a benchmark (linear) model, a Holt-Winters time series model, and a functional model. The table above compares the mean square error and the absoluted error between these 3 models. It can be clearly observed that the linear model from the 3 model comparisons was not an appropriate model to use in this case. It produced the highest error and did not seem to fit this series, as seen in the table above. However, it provided us with a minimal (“worst-case”) method of forecasting sea surface temperatures. More work and research would put into the next two models. Our team worked to learn and apply two new statistical techniques, and it is evident that the effort in doing so paid off. As we can see, both the Holt-Winters model and the functional model fit the series very well, especially the later half of the series. The functional model produced a higher MSE but only marginally (~0.257). The functional model also produced a higher absolute error but only marginally (~1.1137).

From this project, one of the biggest take-aways was researching and applying these new techniques with minimal formal instruction. From the results, we can see that the Holt-Winters model fit the series best, and the complex steps that went into constructing the functional model may not have been as advantageous in producing a better fit; however, this allowed our team to go through the process of developing and comparing these models.

Work Cited

Chatfield, Chris, and Mohammed Yar. “Prediction Intervals for the Holt-Winters Forecasting Procedure.” International Journal of Forecasting, vol. 6, 1990, pp. 127–137.

“Decompose.” RDocumentation.

Holt, C. E. (1957). Forecasting seasonals and trends by exponentially weighted averages (O.N.R. Memorandum No. 52). Carnegie Institute of Technology, Pittsburgh USA.

Smith, Patricia L. “Splines As a Useful and Convenient Statistical Tool.” The American Statistician, vol. 33, no. 2, 1979, pp. 57–62. JSTOR, Accessed 29 Apr. 2020.

Winters, P. R. (1960). Forecasting sales by exponentially weighted moving averages. Management Science, 6, 324–342.